Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: experimental coalesced read to shared memory before dequantization #10999

Closed

Conversation

netrunnereve
Copy link
Collaborator

This is an ugly and hacky experiment that's kept as a draft for a reason, but it makes Q6_K run almost as fast as Q4_K_M. Basically the idea is to first read the superblocks into shared memory using perfectly coalesced loads and then do the actual dequantizations using the fast shared memory. As this completely separates the memory reads from the dequantization code this also allows us to explore different algorithms for dequantization that don't rely on coalescing.

With any other language I'd be able to easily do this, but GLSL has no pointers, memcpy, and all that. For some reason though it lets me index an array of structs beyond it's stated size to access the next element, and that doesn't cause an error or crash my computer. As a result this is not suitable for release unless we have a way of doing this properly, and I'm just leaving this here for feedback and suggestions.

It's possible to make this go faster by using 32-bit loads per thread if the alignment allows for it, but again it's the language that's holding me back. The current code runs with a fixed 64 subgroup size.

model size params backend ngl threads main_gpu sm test t/s
llama 7B Q6_K (Master) 5.53 GiB 7.24 B Vulkan 100 8 1 none tg128 20.39 ± 0.00
llama 7B Q6_K (PR) 5.53 GiB 7.24 B Vulkan 100 8 1 none tg128 25.15 ± 0.02
Master
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   368.04 us/run - 117.44 MFLOP/run - 319.10 GFLOPS

PR
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   294.77 us/run - 117.44 MFLOP/run - 398.41 GFLOPS

In 158ab15 I also have a proper version of this which runs at 23 t/s with the same Q6_K model. Here the loads are coalesced for each 16 thread superblock only and no hacks are required, but I think it only helps with AMD GCN as it's 64 wide SIMD is actually 4 16 wide units internally (and the loads are done in blocks of 16 threads).

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 27, 2024
@netrunnereve
Copy link
Collaborator Author

Turns out the CI is failing as llvmpipe isn't happy with this change, though strangely enough it provides good output and tests clean with RADV.

GL_EXT_shared_memory_block might be a way to do this properly provided I can find some examples on how to use it.

@jeffbolznv
Copy link
Collaborator

On my RTX 4070, commit b3b30e4 is maybe 1-2% slower than master and commit 158ab15 is maybe half a percent faster than master.

GL_EXT_shared_memory_block might be a way to do this properly provided I can find some examples on how to use it.

Yeah, I was going to suggest this if you need to treat the data differently when storing it than when loading it. IIRC the syntax is just:

shared Block1 {
   // members, optionally with offset decorations
};
shared Block2 {
   // members, optionally with offset decorations
};

and then data in each block can potentially alias.

@netrunnereve
Copy link
Collaborator Author

On my RTX 4070, commit b3b30e4 is maybe 1-2% slower than master

This might actually be fine considering how I've forced it to use a subgroup size of 64. It should go faster with 32, but then again if you're memory bound this may not help much.

Yeah, I was going to suggest this if you need to treat the data differently when storing it than when loading it.

Thanks. I managed to get GL_EXT_shared_memory_block working properly with llvmpipe but it spits out gibberish on AMD, it looks like the driver has some bugs with this extension. Even the simple change below on master causes incorrect output on my GPU.

---------- ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp ----------
index fab4ff5f..dfcae382 100644
@@ -1,6 +1,7 @@
 #version 450
 
 #extension GL_EXT_shader_explicit_arithmetic_types : require
+#extension GL_EXT_shared_memory_block : require
 
 #include "mul_mat_vec_base.comp"
 
@@ -9,7 +10,9 @@ layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
 layout (constant_id = 0) const uint BLOCK_SIZE = 32;
 layout (constant_id = 1) const uint NUM_ROWS = 1;
 
-shared FLOAT_TYPE tmpsh[NUM_ROWS][BLOCK_SIZE];
+shared shblk {
+	FLOAT_TYPE tmpsh[NUM_ROWS][BLOCK_SIZE];
+};
 
 void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
     uint a_offset, b_offset, d_offset;

@jeffbolznv
Copy link
Collaborator

I managed to get GL_EXT_shared_memory_block working properly with llvmpipe but it spits out gibberish on AMD, it looks like the driver has some bugs with this extension. Even the simple change below on master causes incorrect output on my GPU.

I looked at the SPIR-V that gets generated and I think what's happening is the ArrayStride for the inner array of tmpsh is being set to 128. This is a known limitation of SPIR-V where structure layouts that depend on spec constants don't really work because the layout doesn't get updated based on the spec constant. If you had BLOCK_SIZE = 64 in the shader it would probably work on AMD.

Keep in mind that GL_EXT_shared_memory_block is probably not supported everywhere, so we'd need to have a fallback path. So need to decide whether the perf we get from it is worth the additional maintenance cost.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 29, 2024

We would definitely need a fallback path. I think all modern desktop devices support the required extension VK_KHR_workgroup_memory_explicit_layout, but many phones don't and some people also use Vulkan on older GPU/driver combinations on desktop.

Edit: You can check how broad the support is on https://vulkan.gpuinfo.org/listextensions.php

This reverts commit ee88a8999f517f60d9f079b7781fadf623cb1f72.

Revert "data b cache example, slower than original"

This reverts commit 97ec5a01403028f0b9c7eeaac858cb9652ff958d.
@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Dec 30, 2024

Keep in mind that GL_EXT_shared_memory_block is probably not supported everywhere, so we'd need to have a fallback path. So need to decide whether the perf we get from it is worth the additional maintenance cost.

We would definitely need a fallback path. I think all modern desktop devices support the required extension VK_KHR_workgroup_memory_explicit_layout, but many phones don't and some people also use Vulkan on older GPU/driver combinations on desktop.

In that case it's definitely not worth it unless it's a huge improvement. Actually regular layout aliasing is working well for me and I'm currently getting 24.6 t/s using a modified block_q6_K_packed16 structure with no hacks required.

I also tried the same caching method with the B vector and it made things much slower. Now if subgroup operators actually worked properly (wink wink Intel) it would also be worth testing subgroup shuffles to see how much faster it is compared to shared memory.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 30, 2024

Now if subgroup operators actually worked properly (wink wink Intel) it would also be worth testing subgroup shuffles to see how much faster it is compared to shared memory.

We could give that another try, the issue might have been subgroups that were not the right size or not fully enabled. We now support the extension to enforce full subgroups.

@netrunnereve
Copy link
Collaborator Author

Surprisingly enough subgroup shuffles ended up being a couple percent slower than shared memory, both for my Q6_K experiment and applied to the IQ4_NL LUT. Considering how close the IQ4_NL and Q4_0 results are it's safe to say that the on chip shared memory is really fast, but I didn't expect it to be faster than a direct register shuffle.

Anyways I'm closing this PR as I'll be busy with work starting next week and the performance improvements here just aren't big enough to be worth all this effort. I'll deliver up the non shared memory related fixes for Q6_K in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants